Translating User-Generated Content in the Social Networking Space
نویسندگان
چکیده
This paper presents a case-study of work done by Applied Language Solutions (ALS) for a large social networking provider who claim to have built the world’s first multi-language social network, where Internet users from all over the world can communicate in languages that are available in the system. In an initial phase, the social networking provider contracted ALS to build Machine Translation (MT) engines for twelve languagepairs: Russian⇔English, Russian⇔Turkish, Russian⇔Arabic, Turkish⇔English, Turkish ⇔Arabic and Arabic⇔English. All of the input data is user-generated content, so we faced a number of problems in building largescale, robust, high-quality engines. Primarily, much of the source-language data is of ‘poor’ or at least ‘non-standard’ quality. This comes in many forms: (i) content produced by non-native speakers, (ii) content produced by native speakers containing non-deliberate typos, or (iii) content produced by native speakers which deliberately departs from spelling norms to bring about some linguistic effect. Accordingly, in addition to the ‘regular’ preprocessing techniques used in the building of our statistical MT systems, we needed to develop routines to deal with all these scenarios. In this paper, we describe how we handle shortforms, acronyms, typos, punctuation errors, non-dictionary slang, wordplay, censor avoidance and emoticons. We demonstrate automatic evaluation scores on the social network data, together with insights from the the social networking provider regarding some of the typical errors made by the MT engines, and how we managed to correct these in the engines.
منابع مشابه
Implications of User Generated Content on Facebook
The purpose of this study is to examine the implications (user benefits and costs) of user generated content posted by users on Facebook to individual users. Although motivations to use social networking sites are widely researched and published, studies on implications of information on social networking sites is sparse. Hence, this study addresses this gap by an interpretive analysis of user ...
متن کاملMining User Profiles to Support Structure and Explanation in Open Social Networking
The proliferation of media sharing and social networking websites has brought with it vast collections of site-specific user generated content. The result is a Social Networking Divide in which the concepts and structure common across different sites are hidden. The knowledge and structures from one social site are not adequately exploited to provide new information and resources to the same or...
متن کاملAutomatic Hashtag Recommendation in Social Networking and Microblogging Platforms Using a Knowledge-Intensive Content-based Approach
In social networking/microblogging environments, #tag is often used for categorizing messages and marking their key points. Also, since some social networks such as twitter apply restrictions on the number of characters in messages, #tags can serve as a useful tool for helping users express their messages. In this paper, a new knowledge-intensive content-based #tag recommendation system is intr...
متن کاملSocial Networking Websites - A Concatenation of Impersonation, Denigration, Sexual Aggressive Solicitation, Cyber-Bullying or Happy Slapping Videos
Hands-off legislation, toothless policy statements, unknowing parents, uncaring participants, and unwilling social network intermediaries (SNIs), have conspired to invite impersonation, denigration, sexual or aggressive solicitation, cyber-bullying, and happy slapping to the members of most social networking websites (SNWs). The situation is serious serious because the user-generated content (U...
متن کاملYour Members Are Also Your Customers: Marketing for Internet Social Networks
Perhaps the fastest growing arena in the World Wide Web is the space of social networking sites (e.g., Friendster, Facebook, MySpace). The success of these sites directly depends on the number and activity level of their users. What attracts users to the site is a continually changing digital content (e.g., messages, pictures, photos, music, videos, blogs) generated by other users. In contrast,...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2012